Seperating characters in dfm object R

debugcn Published at Dev

Damon C. Roberts

all,

I have imported the sotu corpus from quanteda in R. I am somewhat new to dfm objects and am wanting to separate the doc_id column to give me a name and a year column. If this was a tibble, this code works:

library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- as_tibble(data_corpus_sotu)
sotusubsetted <- sotu %>%
   separate(doc_id, c("name","year"),"-")

However, since I am new with dfm and regex, I am not sure if there is an equivalent process if I load in the data as:

library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- corpus(data_corpus_sotu)
sotudfm <- dfm(sotu)

Is there some equivalent way to do this with dfm objects?

Ken Benoit

The safest method is also one that will work for any core quanteda object, meaning equally for a corpus, tokens, or dfm object. These involve using the accessor functions, not addressing the internals of the corpus or dfm objects directly, which is strongly discouraged. You can do that, but your code could break in the future if those object structures are changed. In addition, our accessor functions are generally also the most efficient method.

For this task, you want to use the docnames() functions or accessing the document IDs, and this works for the corpus as well as for the dfm.

library("quanteda")
## Package version: 2.1.2

data("data_corpus_sotu", package = "quanteda.corpora")

data.frame(doc_id = docnames(data_corpus_sotu[1:5])) %>%
  tidyr::separate(doc_id, c("name", "year"), "-")
##         name  year
## 1 Washington  1790
## 2 Washington 1790b
## 3 Washington  1791
## 4 Washington  1792
## 5 Washington  1793

data.frame(doc_id = docnames(dfm(data_corpus_sotu[1:5]))) %>%
  tidyr::separate(doc_id, c("name", "year"), "-")
##         name  year
## 1 Washington  1790
## 2 Washington 1790b
## 3 Washington  1791
## 4 Washington  1792
## 5 Washington  1793

You could also have taken this from the "President" docvar field and the "Date":

data.frame(
  name = data_corpus_sotu$President,
  year = lubridate::year(data_corpus_sotu$Date)
) %>%
  head()
##         name year
## 1 Washington 1790
## 2 Washington 1790
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
## 6 Washington 1794

^{Created on 2021-02-13 by the reprex package (v1.0.0)}

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-04-8

Comments

0 comments

From Dev

Related Related

Article

Seperating characters in dfm object R

Seperating characters in dfm object R

Difference in interface and implementation section having {$R *.DFM} directive

R: find ngram using dfm when there are multiple sentences in one document

Seperating a dataset by column value

Seperating same values into columns

Construct a character vector of multi-word phrases using regex for building dfm using quanteda in R

Asking for multiple filenames and seperating them in comma in BATCH

Non alphanumeric characters in R

Substituting special characters in R

Tabulating characters with diacritics in R

Removing characters in a string with R

separate number of characters in R

R friendly greek characters

ascii characters in R, howto

How to remove characters in r

R gsub with special characters

Insertion of characters in strings in R

separate characters concatenated with + in R

Substituting special characters in R

French characters on highchart r

Finding Unicode characters from Object

Object or package containing all characters

Adding Characters to each object in Array

Object or package containing all characters

Replacing an object name with characters in a loop

JavaScript -- encode object with special characters

Convert a string object with byte characters into a byte object?

Concatenate dfm matrices in 'quanteda' package

Include ID number in dfm() output

Create dfm step by step with quanteda