Seperating characters in dfm object R

Damon C. Roberts

all,

I have imported the sotu corpus from quanteda in R. I am somewhat new to dfm objects and am wanting to separate the doc_id column to give me a name and a year column. If this was a tibble, this code works:

library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- as_tibble(data_corpus_sotu)
sotusubsetted <- sotu %>%
   separate(doc_id, c("name","year"),"-")

However, since I am new with dfm and regex, I am not sure if there is an equivalent process if I load in the data as:

library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- corpus(data_corpus_sotu)
sotudfm <- dfm(sotu)

Is there some equivalent way to do this with dfm objects?

Ken Benoit

The safest method is also one that will work for any core quanteda object, meaning equally for a corpus, tokens, or dfm object. These involve using the accessor functions, not addressing the internals of the corpus or dfm objects directly, which is strongly discouraged. You can do that, but your code could break in the future if those object structures are changed. In addition, our accessor functions are generally also the most efficient method.

For this task, you want to use the docnames() functions or accessing the document IDs, and this works for the corpus as well as for the dfm.

library("quanteda")
## Package version: 2.1.2

data("data_corpus_sotu", package = "quanteda.corpora")

data.frame(doc_id = docnames(data_corpus_sotu[1:5])) %>%
  tidyr::separate(doc_id, c("name", "year"), "-")
##         name  year
## 1 Washington  1790
## 2 Washington 1790b
## 3 Washington  1791
## 4 Washington  1792
## 5 Washington  1793

data.frame(doc_id = docnames(dfm(data_corpus_sotu[1:5]))) %>%
  tidyr::separate(doc_id, c("name", "year"), "-")
##         name  year
## 1 Washington  1790
## 2 Washington 1790b
## 3 Washington  1791
## 4 Washington  1792
## 5 Washington  1793

You could also have taken this from the "President" docvar field and the "Date":

data.frame(
  name = data_corpus_sotu$President,
  year = lubridate::year(data_corpus_sotu$Date)
) %>%
  head()
##         name year
## 1 Washington 1790
## 2 Washington 1790
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
## 6 Washington 1794

Created on 2021-02-13 by the reprex package (v1.0.0)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Difference in interface and implementation section having {$R *.DFM} directive

From Dev

R: find ngram using dfm when there are multiple sentences in one document

From Dev

Seperating a dataset by column value

From Dev

Seperating same values into columns

From Dev

Construct a character vector of multi-word phrases using regex for building dfm using quanteda in R

From Dev

Asking for multiple filenames and seperating them in comma in BATCH

From Dev

Non alphanumeric characters in R

From Dev

Substituting special characters in R

From Dev

Tabulating characters with diacritics in R

From Dev

Removing characters in a string with R

From Java

separate number of characters in R

From Java

R friendly greek characters

From Dev

ascii characters in R, howto

From Dev

How to remove characters in r

From Dev

R gsub with special characters

From Dev

Insertion of characters in strings in R

From Dev

separate characters concatenated with + in R

From Dev

Substituting special characters in R

From Dev

French characters on highchart r

From Dev

Finding Unicode characters from Object

From Dev

Object or package containing all characters

From Dev

Adding Characters to each object in Array

From Dev

Object or package containing all characters

From Dev

Replacing an object name with characters in a loop

From Dev

JavaScript -- encode object with special characters

From Dev

Convert a string object with byte characters into a byte object?

From Dev

Concatenate dfm matrices in 'quanteda' package

From Dev

Include ID number in dfm() output

From Dev

Create dfm step by step with quanteda